Data Visualization Notes

This is a starter RMarkdown template to accompany Data Visualization (Princeton University Press, 2019). You can use it to take notes, write your code, and produce a good-looking, reproducible document that records the work you have done. At the very top of the file is a section of metadata, or information about what the file is and what it does. The metadata is delimited by three dashes at the start and another three at the end. You should change the title, author, and date to the values that suit you. Keep the output line as it is for now, however. Each line in the metadata has a structure. First the key (“title”, “author”, etc), then a colon, and then the value associated with the key.

This is an RMarkdown File

Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. A code chunk is a specially delimited section of the file. You can add one by moving the cursor to a blank line choosing Code > Insert Chunk from the RStudio menu. When you do, an empty chunk will appear:

Code chunks are delimited by three backticks (found to the left of the 1 key on US and UK keyboards) at the start and end. The opening backticks also have a pair of braces and the letter r, to indicate what language the chunk is written in. You write your code inside the code chunks. Write your notes and other material around them, as here.

Before you Begin

To install the tidyverse, make sure you have an Internet connection. Then manually run the code in the chunk below. If you knit the document if will be skipped. We do this because you only need to install these packages once, not every time you run this file. Either knit the chunk using the little green “play” arrow to the right of the chunk area, or copy and paste the text into the console window.

## This code will not be evaluated automatically.
## (Notice the eval = FALSE declaration in the options section of the
## code chunk)

my_packages <- c("tidyverse", "broom", "coefplot", "cowplot",
                 "gapminder", "GGally", "ggrepel", "ggridges", "gridExtra",
                 "here", "interplot", "margins", "maps", "mapproj",
                 "mapdata", "MASS", "quantreg", "rlang", "scales",
                 "survey", "srvyr", "viridis", "viridisLite", "devtools")

install.packages(my_packages, repos = "http://cran.rstudio.com")

Set Up Your Project and Load Libraries

To begin we must load some libraries we will be using. If we do not load them, R will not be able to find the functions contained in these libraries. The tidyverse includes ggplot and other tools. We also load the socviz and gapminder libraries.

Notice that here, the braces at the start of the code chunk have some additional options set in them. There is the language, r, as before. This is required. Then there is the word setup, which is a label for your code chunk. Labels are useful to briefly say what the chunk does. Label names must be unique (no two chunks in the same document can have the same label) and cannot contain spaces. Then, after the comma, an option is set: include=FALSE. This tells R to run this code but not to include the output in the final document.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

gapminder
## # A tibble: 1,704 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # … with 1,694 more rows

The remainder of this document contains the chapter headings for the book, and an empty code chunk in each section to get you started. Try knitting this document now by clicking the “Knit” button in the RStudio toolbar, or choosing File > Knit Document from the RStudio menu.

Look at Data

Get Started

p <- ggplot(data = gapminder, mapping = aes(x=gdpPercap, y=lifeExp))
p + geom_point()

Make a Plot

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point()

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point() + geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'

…“an ill-advised linear fit”

In the plot, data is bunched up against the left side. The x-scale would probably look better if it were converted from a linear scale to a log scale, using the function scale_x_log10().

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point() + geom_smooth(method="gam") + scale_x_log10()
## `geom_smooth()` using formula 'y ~ s(x, bs = "cs")'

Let’s tidy up the axes.

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point() + geom_smooth(method="gam") + scale_x_log10(labels = scales::dollar)
## `geom_smooth()` using formula 'y ~ s(x, bs = "cs")'

Adding some colour

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point(color="purple") + geom_smooth(method="loess") + scale_x_log10(labels = scales::dollar)
## `geom_smooth()` using formula 'y ~ x'

removing the Standard Error ribbon (se)

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point(alpha=0.3) + geom_smooth(color="orange", se = FALSE, size=1, method="lm") + scale_x_log10(labels = scales::dollar)
## `geom_smooth()` using formula 'y ~ x'

fix the labels

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point(alpha=0.3) + geom_smooth(method="gam") + scale_x_log10(labels = scales::dollar) + labs(x="GDP Per Capita", y="Life Expectancy in Years", title="Economic Growth and Life Expectancy", subtitle="Data points are country-years", caption="Source: Gapminder")
## `geom_smooth()` using formula 'y ~ s(x, bs = "cs")'

Add continent information

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, color=continent))
p + geom_point(alpha=0.3) + geom_smooth(method="loess") + scale_x_log10(labels = scales::dollar) + labs(x="GDP Per Capita", y="Life Expectancy in Years", title="Economic Growth and Life Expectancy", subtitle="Data points are country-years", caption="Source: Gapminder")
## `geom_smooth()` using formula 'y ~ x'

Colouring the SE shading.

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, color=continent, fill=continent))
p + geom_point(alpha=0.3) + geom_smooth(method="loess") + scale_x_log10(labels = scales::dollar) + labs(x="GDP Per Capita", y="Life Expectancy in Years", title="Economic Growth and Life Expectancy", subtitle="Data points are country-years", caption="Source: Gapminder")
## `geom_smooth()` using formula 'y ~ x'

Using one SE line

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point(mapping=aes(color=continent)) + geom_smooth(method="loess") + scale_x_log10(labels = scales::dollar) + labs(x="GDP Per Capita", y="Life Expectancy in Years", title="Economic Growth and Life Expectancy", subtitle="Data points are country-years", caption="Source: Gapminder")
## `geom_smooth()` using formula 'y ~ x'

ggsave(filename="my_figure.png")
## Saving 8 x 5 in image
## `geom_smooth()` using formula 'y ~ x'

Mapping continuous variables to colour

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point(mapping=aes(color=log(pop))) + geom_smooth(method="loess") + scale_x_log10(labels = scales::dollar) + labs(x="GDP Per Capita", y="Life Expectancy in Years", title="Economic Growth and Life Expectancy", subtitle="Data points are country-years", caption="Source: Gapminder")
## `geom_smooth()` using formula 'y ~ x'

Putting ‘smooth’ before ‘point’

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p  + geom_smooth(method="loess") + geom_point(mapping=aes(color=continent)) + scale_x_log10(labels = scales::dollar) + labs(x="GDP Per Capita", y="Life Expectancy in Years", title="Economic Growth and Life Expectancy", subtitle="Data points are country-years", caption="Source: Gapminder")
## `geom_smooth()` using formula 'y ~ x'

ggsave(filename="my_figure.png")
## Saving 8 x 5 in image
## `geom_smooth()` using formula 'y ~ x'

Show the Right Numbers

p <- ggplot(data=gapminder, mapping=aes(x=year, y=gdpPercap))
p + geom_line(aes(group=country))

Using Facet to make small multiples

p <-  ggplot(data=gapminder, mapping=aes(x = year, y = gdpPercap))
p + geom_line(aes(group=country)) + facet_wrap(~continent)

Arranging the facets

p <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap))
p + geom_line(color="gray70", aes(group = country)) + geom_smooth(size = 1.1, method = "loess", se = FALSE) + scale_y_log10(labels=scales::dollar) + facet_wrap(~continent, ncol = 5) + labs(x = "Year", y = "GDP per capita", title = "GDP per capita on Five Continents")
## `geom_smooth()` using formula 'y ~ x'

p <- ggplot(data = gss_sm, mapping=aes(x = age, y=childs))
p + geom_point(alpha = 0.2) + geom_smooth() + facet_grid( sex ~ race)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 18 rows containing non-finite values (stat_smooth).
## Warning: Removed 18 rows containing missing values (geom_point).

p <- ggplot(data = gss_sm, mapping=aes(x = bigregion))
p + geom_bar()

p <- ggplot(data = gss_sm, mapping=aes(x = bigregion))
p + geom_bar(mapping = aes(y = ..prop.., group = 1))

Colouring bar charts

p <-  ggplot(data = gss_sm, mapping = aes(x = religion, fill = religion))
p + geom_bar()

Colouring Bar Charts, and getting rid of the unnecessary legend

p <-  ggplot(data = gss_sm, mapping = aes(x = religion, fill = religion))
p + geom_bar() + guides(fill=FALSE)

Creating a stacked Bar Chart

p <- ggplot(data=gss_sm, mapping = aes(x = bigregion, fill = religion))
p + geom_bar()

Making the stacks the same height

p <- ggplot(data=gss_sm, mapping = aes(x = bigregion, fill = religion))
p + geom_bar(position="fill")

Making new columns for each category

p <- ggplot(data=gss_sm, mapping = aes(x = bigregion, fill = religion))
p + geom_bar(position="dodge")

Making the stacks the same height, with new columns for each category

p <- ggplot(data=gss_sm, mapping = aes(x = bigregion, fill = religion))
p + geom_bar(position="dodge", mapping=aes(y=..prop..))

Making new columns for each category with proportions, not counts

p <- ggplot(data=gss_sm, mapping = aes(x = bigregion, fill = religion))
p + geom_bar(position="dodge", mapping=aes(y=..prop.., group=religion))

Using facet_wrap

p <- ggplot(data=gss_sm, mapping = aes(x = religion))
p + geom_bar(position="dodge", mapping=aes(y=..prop.., group=bigregion)) + facet_wrap(~bigregion, ncol=2)

Creating Histograms

p <-  ggplot(data = midwest, mapping = aes(x = area))
p + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p <-  ggplot(data = midwest, mapping = aes(x = area))
p + geom_histogram(bins=10)

Comparing two histograms

oh_wi <- c("OH", "WI")

p <- ggplot(data = subset(midwest, subset = state %in% oh_wi), mapping = aes(x=percollege, fill=state))
p + geom_histogram(alpha = 0.4, bins = 20)

Kernel density as an alternative to histograms

p <- ggplot(data = midwest, mapping = aes(x=area))
p + geom_density()

Using colour with kernel density

p <- ggplot(data = midwest, mapping = aes(x=area, fill=state, color=state))
p + geom_density(alpha = 0.3)

Creating stacked density plots

p <- ggplot(data=subset(midwest, subset = state %in% oh_wi), mapping = aes(x=area, fill=state, color=state))
p + geom_density(alpha = 0.3, mapping = (aes(y = ..scaled..)))

Working with data that has already been summarized

p <-  ggplot(data=titanic, mapping = aes(x = fate, y = percent, fill = sex))
p + geom_bar(position = "dodge", stat = "identity") + theme(legend.position = "top")

Using ‘position = identity’ to plot values as given.

p <- ggplot(data = oecd_sum, mapping = aes( x= year, y = diff, fill = hi_lo))
p + geom_col() + guides(fill = FALSE) + labs(x = NULL, y = "Difference in Years", title = "The US Life Expectancy Gap", subtitle = "Difference between US and OECD average life expectancies, 1960-2015", caption = "Data: OECD. After a chart by Christopher Ingraham, Washington Post, December 17th, 2017.")
## Warning: Removed 1 rows containing missing values (position_stack).

Keynote test

data = read.csv("Keynote.csv")
p <- ggplot(data = data, mapping = aes(x = year, y = amount))
p + geom_line(aes(group = country)) + facet_wrap(~ country, ncol=4)

Graph Tables, Make Labels, Add Notes

Using pipes to summarize data

rel_by_region <-  gss_sm %>% group_by(bigregion, religion) %>% summarize(N = n()) %>% mutate(freq = N / sum(N), pct = round((freq * 100), 0))
## Warning: Factor `religion` contains implicit NA, consider using
## `forcats::fct_explicit_na`
rel_by_region
## # A tibble: 24 x 5
## # Groups:   bigregion [4]
##    bigregion religion       N    freq   pct
##    <fct>     <fct>      <int>   <dbl> <dbl>
##  1 Northeast Protestant   158 0.324      32
##  2 Northeast Catholic     162 0.332      33
##  3 Northeast Jewish        27 0.0553      6
##  4 Northeast None         112 0.230      23
##  5 Northeast Other         28 0.0574      6
##  6 Northeast <NA>           1 0.00205     0
##  7 Midwest   Protestant   325 0.468      47
##  8 Midwest   Catholic     172 0.247      25
##  9 Midwest   Jewish         3 0.00432     0
## 10 Midwest   None         157 0.226      23
## # … with 14 more rows
p <- ggplot(rel_by_region, aes(x = bigregion, y = pct, fill=religion))
p + geom_col(position = "dodge2") + labs(x= "Region", y = "Percent", fill = "Religion") + theme(legend.position = "top")

Flipping the charts

p <- ggplot(rel_by_region, aes(x = religion, y = pct, fill=religion))
p + theme(panel.background = element_rect(fill = 'white', color="black")) + geom_col(position = "dodge2") + labs(x= NULL, y = "Percent", fill = "Religion") + guides(fill = FALSE) + coord_flip() + facet_grid(~ bigregion) 

Continuous variables by group or category

organdata %>% select(1:6) %>% sample_n(size = 10)
## # A tibble: 10 x 6
##    country       year       donors    pop pop_dens   gdp
##    <chr>         <date>      <dbl>  <int>    <dbl> <int>
##  1 Spain         1996-01-01   26.8  39279    7.76  16416
##  2 Ireland       1994-01-01   20.3   3590    5.11  15990
##  3 United States 1993-01-01   18.7 259919    2.70  25327
##  4 Finland       NA           NA     4986    1.47  18025
##  5 Denmark       1998-01-01   11     5304   12.3   25537
##  6 Sweden        1991-01-01   16.4   8617    1.92  19000
##  7 Germany       1991-01-01   13.3  80014   22.4   17511
##  8 France        1993-01-01   17.1  57467   10.4   19763
##  9 Canada        2001-01-01   13.5  31111    0.312 29235
## 10 Ireland       NA           NA     3514    5.00  12917
p <-  ggplot(data = organdata, mapping = aes(x = year, y = donors))
p + geom_point()
## Warning: Removed 34 rows containing missing values (geom_point).

Faceting the data

p <- ggplot(data = organdata, mapping = aes(x = year, y = donors))
p + geom_line(aes(group = country)) + facet_wrap(~country, ncol = 4)
## Warning: Removed 34 row(s) containing missing values (geom_path).

Making a boxplot

p <- ggplot(data = organdata, mapping=aes(x=country, y = donors))
p + geom_boxplot()
## Warning: Removed 34 rows containing non-finite values (stat_boxplot).

Flipping the labels

p <- ggplot(data = organdata, mapping=aes(x=country, y = donors))
p + geom_boxplot() + coord_flip()
## Warning: Removed 34 rows containing non-finite values (stat_boxplot).

Reorder the information

p <- ggplot(data = organdata, mapping=aes(x=reorder(country, donors, na.rm = TRUE), y = donors))
p + geom_boxplot() + labs(x=NULL) + coord_flip()
## Warning: Removed 34 rows containing non-finite values (stat_boxplot).

Reorder the information (with the NULL removed)

p <- ggplot(data = organdata, mapping=aes(x=reorder(country, donors, na.rm = TRUE), y = donors))
p + geom_boxplot() + coord_flip()
## Warning: Removed 34 rows containing non-finite values (stat_boxplot).

Making a violin chart

p <- ggplot(data = organdata, mapping=aes(x=reorder(country, donors, na.rm = TRUE), y = donors))
p + geom_violin() + labs(x=NULL) + coord_flip()
## Warning: Removed 34 rows containing non-finite values (stat_ydensity).

Adding color to the boxchart

p <- ggplot(data = organdata, mapping=aes(x=reorder(country, donors, na.rm = TRUE), y = donors, fill=world))
p + geom_boxplot() + labs(x=NULL) + coord_flip() + theme(legend.position="top")
## Warning: Removed 34 rows containing non-finite values (stat_boxplot).

Work with Models

Draw Maps

Refine your Plots